# SICSS Norrköping 2025: Network Analysis (SOLUTIONS)

Author

Alexandra Rottenkolber, updated by Vsevolod Suschevskiy

Published

June 12, 2025

1 Introduction & getting started

Welcome to the practical part of the network analysis day! In this session, you will get an introduction to how to represent a network with the help of R (or Python, see other script), how to calculate basic network descriptives (on the micro, meso and macro level), and how to put on the “network thinking hat”. You are invited to work through the exercise sheet on your own/in groups, and with the help of mine and Carl where needed. If you are already familiar with some aspects of the lab, feel free to jump over certain sections or go straight to the exercise part.

Learning goals for this workshop

  1. Introduction
  • Represent a network in R (with IGRAPH)
  • Create a graph from scratch (add nodes, edges, attributes)
  • Visualise it
  1. Basic network descriptives
  • macro-level descriptives
  • meso-level descriptives
  • micro-level descriptives
  1. Network thinking
  • Majority illusion
  • Friendship paradox

1.1 Introduction

Getting started. First, we need to install the most important packages for network analysis in R.
There are plenty of R packages for network analysis available, which are tailored to perform different types of analysis. The most important ones for the beginning are

  • “sna”
  • “igraph”
  • “network”
  • “ggraph” for visualisations (uses ggplot2)
  • “egor” for ego-nets (uses igraph objects)
  • “netseg” for assortativity patterns, homophily, segregation, within-group mixing, etc.
  • “isnar”
  • “networkdata” a large sample of networks in igraph format (to install network data, see: https://github.com/schochastics/networkdata)
  • “igraphdata” several network datasets
  • “tidygraph” for tidy manipulation of igraph objects
  • etc.

To install the packages you need, you can simply type the name of the package within install.packages(""), e.g. install.packages(c('sna','igraph')) install.packages("here")

1.2 Creating a graph object

The most widely used R packages for (simply) network analysis are called sna and igraph. To create a graph object, SNA works with matrices as inputs, while this is not the case for IGRAPH. IGRAPH requires igraph objects as inputs. You can find information on the basic IGRAPH datatypes here.

# In SNA, the basic element is the MATRIX
mtx <- matrix(c(0, 1, 0, 0, 0, 0, 0,
                1, 0, 1, 1, 1, 0, 0,
                1, 1, 0, 0, 0, 0, 0,
                0, 1, 0, 0, 1, 0, 0,
                0, 1, 0, 1, 0, 1, 1,
                0, 0, 0, 0, 1, 0, 1,
                0, 0, 0, 0, 1, 1, 0),
              nrow=7,
              ncol=7,
              byrow=TRUE)
mtx
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]    0    1    0    0    0    0    0
[2,]    1    0    1    1    1    0    0
[3,]    1    1    0    0    0    0    0
[4,]    0    1    0    0    1    0    0
[5,]    0    1    0    1    0    1    1
[6,]    0    0    0    0    1    0    1
[7,]    0    0    0    0    1    1    0
class(mtx) # An array is a n-dimensional object
[1] "matrix" "array" 
dim(mtx) # Shows dimensions of the matrix. Here, we have a 7x7 matrix.
[1] 7 7
# Usually, we would like to label the dimensions
dimnames(mtx) <- list(c('Kira','Amaya','Rohan', 'Robin', 'Hanna', 'Adam', 'Igor'),c('Kira','Amaya','Rohan','David', 'Hanna', 'Adam', 'Igor'))
mtx
      Kira Amaya Rohan David Hanna Adam Igor
Kira     0     1     0     0     0    0    0
Amaya    1     0     1     1     1    0    0
Rohan    1     1     0     0     0    0    0
Robin    0     1     0     0     1    0    0
Hanna    0     1     0     1     0    1    1
Adam     0     0     0     0     1    0    1
Igor     0     0     0     0     1    1    0
# We can access each element of a matrix separately
mtx[1,2] # first row, second column
[1] 1
mtx['Adam','Igor'] # sixth row, seventh column
[1] 1
# And we can replace elements by assigning a new value
mtx[3,1] <- 0 # Assigns 0 to the element in the 3rd row, 1st column
mtx
      Kira Amaya Rohan David Hanna Adam Igor
Kira     0     1     0     0     0    0    0
Amaya    1     0     1     1     1    0    0
Rohan    0     1     0     0     0    0    0
Robin    0     1     0     0     1    0    0
Hanna    0     1     0     1     0    1    1
Adam     0     0     0     0     1    0    1
Igor     0     0     0     0     1    1    0
# In social network analysis, the diagonal of a matrix is often set to NA. If done so it indicates that self-loops cannot exist per definition and aren't just not present in the data.
diag(mtx) # Returns the elements in positions (1,1),... (n,n)
[1] 0 0 0 0 0 0 0
diag(mtx) <- NA 
mtx
      Kira Amaya Rohan David Hanna Adam Igor
Kira    NA     1     0     0     0    0    0
Amaya    1    NA     1     1     1    0    0
Rohan    0     1    NA     0     0    0    0
Robin    0     1     0    NA     1    0    0
Hanna    0     1     0     1    NA    1    1
Adam     0     0     0     0     1   NA    1
Igor     0     0     0     0     1    1   NA
mtx |> 
  reshape2::melt() |> 
ggplot2::ggplot(ggplot2::aes(x=Var1,y=Var2,fill=value |> as.factor()))+
  ggplot2::geom_tile()+
  ggplot2::labs(title='Adjacency matrix', x=NULL, y=NULL)+
  ggplot2::scale_fill_manual(values=c('white','steelblue', "lightgray"), name='Contact', labels=c('No','Yes', NA))+
  ggplot2::theme_minimal()

This matrix could already be interpreted as representing a network (This format is called a network’s adjacency martix.). Say the values indicate whether a person i (in the rows) is in contact with another person (in the columns) on a regular basis. How would you interpret the network?

1.2.1 Visualising a network

gplot(mtx,displaylabels = TRUE)

#plot.igraph(mtx) # this will throw an error 
grph <- mtx |> 
  graph_from_adjacency_matrix(mode='undirected') # create a graph object from our martix
Warning: The `adjmatrix` argument of `graph_from_adjacency_matrix()` must be symmetric
with mode = "undirected" as of igraph 1.6.0.
ℹ Use mode = "max" to achieve the original behavior.
# Using the igraph library
grph |> 
  plot.igraph()

# Using the ggraph library
set.seed(42)
grph |> 
  ggraph() +
  geom_edge_link() +
  geom_node_label(aes(label=name)) +
  theme_void() +
  ggtitle("Network visualisation with ggraph")
Using "stress" as default layout

1.2.2 Adding nodes and edges

There are several ways how to create a graph from scratch. Either you set up an adjacency matrix as shown above. Or you use an edge list as an input or add nodes and edges separately, as shown below.

# use and edge list as an input for an undirected graph
g_undirected <- graph_from_literal(1-2, 2-3, 2-4, 2-5, 4-5, 5-6, 5-7, 6-7) 
g_undirected
IGRAPH efc64fd UN-- 7 8 -- 
+ attr: name (v/c)
+ edges from efc64fd (vertex names):
[1] 1--2 2--3 2--4 2--5 4--5 5--6 5--7 6--7
# or start with an empty graph and add nodes and edges separately
g_undirected_2 <- make_empty_graph(directed = FALSE) # empty graph
g_undirected_2 <- g_undirected_2 |> add_vertices(2) # add nodes
plot(g_undirected_2)

# you can also use the pipe operator and do several steps at once
g_undirected_2 <- g_undirected_2 |>  
  add_vertices(7, color = "red") |>  # you can input attributes in this statement, too
  add_edges(c(2,6, 7,8, 7,9, 8,9))
plot(g_undirected_2)

# For a directed graph: 
# use (-+, +-, or ++) to indicate the arrow ends (with a +)
g_directed <- graph_from_literal(1+-2, 2-+3, 2+-4, 2++5, 4+-5, 5-+6, 5++7, 6++7)  # plus sign captures the arrow end
g_directed
IGRAPH a96c0ce DN-- 7 11 -- 
+ attr: name (v/c)
+ edges from a96c0ce (vertex names):
 [1] 2->1 2->3 2->5 4->2 5->2 5->4 5->6 5->7 6->7 7->5 7->6
plot(g_directed)

# With V() and E() the nodes (vertices) and ties (edges) can be accessed. 
V(g_directed) # nodes/vertices
+ 7/7 vertices, named, from a96c0ce:
[1] 1 2 3 4 5 6 7
E(g_directed) # ties/edges
+ 11/11 edges from a96c0ce (vertex names):
 [1] 2->1 2->3 2->5 4->2 5->2 5->4 5->6 5->7 6->7 7->5 7->6
# Let's rename the nodes
V(g_undirected)$name <- c('Kira','Amaya','Rohan', 'Robin', 'Hanna', 'Adam', 'Igor')
g_undirected
IGRAPH efc64fd UN-- 7 8 -- 
+ attr: name (v/c)
+ edges from efc64fd (vertex names):
[1] Kira --Amaya Amaya--Rohan Amaya--Robin Amaya--Hanna Robin--Hanna
[6] Hanna--Adam  Hanna--Igor  Adam --Igor 
plot(g_undirected)

# Even add additional attributes // store attributes on top of the original matrix
V(g_undirected)$gender <- c('f','f','m','d','f','m','m')
g_undirected
IGRAPH efc64fd UN-- 7 8 -- 
+ attr: name (v/c), gender (v/c)
+ edges from efc64fd (vertex names):
[1] Kira --Amaya Amaya--Rohan Amaya--Robin Amaya--Hanna Robin--Hanna
[6] Hanna--Adam  Hanna--Igor  Adam --Igor 
plot(g_undirected,
     vertex.color= ifelse(V(g_undirected)$gender == "m","steelblue", NA))

vcount(g_undirected) # count vertices
[1] 7
# Attributes to edges are also possible
ecount(g_undirected) # remember it is 10 ties we have
[1] 8
E(g_undirected)$strength <- c(1,5,1,1,8,1,4,5)
g_undirected
IGRAPH efc64fd UN-- 7 8 -- 
+ attr: name (v/c), gender (v/c), strength (e/n)
+ edges from efc64fd (vertex names):
[1] Kira --Amaya Amaya--Rohan Amaya--Robin Amaya--Hanna Robin--Hanna
[6] Hanna--Adam  Hanna--Igor  Adam --Igor 
plot(g_undirected,
     vertex.color=  case_when(
       V(g_undirected)$gender == "m" ~ "steelblue",
       V(g_undirected)$gender == "f" ~ "yellow",
       V(g_undirected)$gender == "d" ~ "lightgreen"), 
     edge.width=E(g_undirected)$strength)

For the next few sections, we will use Zachary’s Karate Club graph as an example network. Zachary’s Karate Club is a very famous network from the social network analysis discipline (this is the paper that made it famous, this is the paper (written by W. Zachary) where it stems from).

In his study, Zachary observed the friendship ties of members of a university’s karate club over the duration of two years. During this time period, a disagreement occurred between the administrator of the club and the club’s instructor, which led to the instructor leaving the club. He eventually founded a new club taking half of the original club’s members with him. Based on the structure of the friendship network, Zachary was able to predict almost exactly which of the two clubs people would join after the split.

This network is so famous that it can be pulled from IGRAPHDATA. You can simply call it by invoking data(karate).

2 Basic network descriptives

2.1 Macro level (global level):

Summary statistics, such as - size, and average degree, degree distribution - average clustering - transitivity

2.1.1 Size and degrees

data(karate) # data is loaded into a variable called karate

karate <- karate |> 
  igraph::upgrade_graph() |> 
  as_tbl_graph()#This graph was created by an old(er) igraph version.

plot(karate)

# number of nodes
#V(karate)
vcount(karate)
[1] 34
# number of edges
#E(karate)
ecount(karate)
[1] 78
# degrees 
#igraph::degree(karate)

# degree distribution 
hist(igraph::degree(karate))

2.1.2 Transitivity and average clustering coefficient

Often also useful descriptives at the macro level is the transitivity coefficient and the average clustering coefficient (clustering coefficients also exist at the local (node-) level). Transitivity describes how many of the existing triads are actually closed. The average clustering coefficient describes – as the name already indicates – the average of all nodes’ clustering coefficient. The node-level clustering coefficient describes the fraction of possible triangles through a node that actually exists.

adj_mtx_karate <- karate |> 
  as_adjacency_matrix(sparse = igraph_opt("sparsematrices")) |> 
  as.matrix() # convert igraph object to matrix (as sna package functions take matrices as input)

sna::isolates(adj_mtx_karate) # Check for isolates
integer(0)
sna::gden(adj_mtx_karate) # Density (Number of actual ties over Number of potential ties)
[1] 0.1390374
# grecip(adj_mtx_karate,measure='edgewise') # For directed graphs one can check for reciprocity (Number of mutual ties over Number of existing ties)

sna::gtrans(adj_mtx_karate, measure='weak') # Transitivity (how many of existing relationship are closed (triangles))
[1] 0.2556818
sna::gtrans(adj_mtx_karate, measure='strong') 
[1] 0.8465909
# Dyadic and triadic configurations
sna::dyad.census(adj_mtx_karate) # number of ties that are mutual, asymmetric, do not exist
     Mut Asym Null
[1,]  78    0  483
sna::triad.census(adj_mtx_karate) # see: https://i.stack.imgur.com/9Xo0R.png 
      003 012  102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C 210
[1,] 3971   0 1575    0    0    0    0    0    0    0 393    0    0    0   0
     300
[1,]  45

2.2 Meso level (everything ‘in between’):

A group of nodes’ characteristics live at the mesoscale, such as - community detection - homophily - assortativity

2.2.1 Community detection

Community detection is a very large field of research and has received a lot of attention in the past. The wish behind community detection basically is to identify a network’s mesoscale organisation. In simplistic terms, a community is a group of nodes which are somewhat more related to each other than to others in the network. Community detection for networks is conceptually similar to data clustering in machine learning. It is helpful if one wants to find nodes that would, for example, react similarly to an external stimulus or if you want to visualize the meso-level organisation of a network.

There are many different algorithms out there to find communities: Some use a group’s internal density, the similarity to neighbours, or the idea of random walks, … As some of the approaches out there differ in their internal logic, they might yield slightly different results.

IGRAPH has some community detection algorithms built-in, which you can find here, and which we will use in the following.

# CLUSTERING ALGORITHMS

# GIRVAN-NEWMAN
# Partitioning is based on edge betweeness. 
gn_clustering <- karate |> 
  cluster_edge_betweenness(modularity=TRUE, membership=TRUE)
Warning in cluster_edge_betweenness(karate, modularity = TRUE, membership =
TRUE): At vendor/cigraph/src/community/edge_betweenness.c:498 : Membership
vector will be selected based on the highest modularity score.
gn_clustering
IGRAPH clustering edge betweenness, groups: 6, mod: 0.35
+ groups:
  $`1`
  [1] "Mr Hi"    "Actor 2"  "Actor 4"  "Actor 8"  "Actor 12" "Actor 13"
  [7] "Actor 18" "Actor 20" "Actor 22"
  
  $`2`
  [1] "Actor 3"  "Actor 10" "Actor 14" "Actor 29"
  
  $`3`
  [1] "Actor 5"  "Actor 6"  "Actor 7"  "Actor 11" "Actor 17"
  
  + ... omitted several groups/vertices
gn_clustering$modularity
 [1] -0.0511047394 -0.0465508517 -0.0294409775 -0.0128558310 -0.0007215007
 [6]  0.0181687000  0.0460073837  0.0582635258  0.0904031034  0.1027904275
[11]  0.1135848279  0.1203875490  0.1422668241  0.1576151122  0.1643897228
[16]  0.1765896441  0.1960045726  0.2127021608  0.2154569817  0.2289499822
[21]  0.2412998257  0.2627480744  0.2777402972  0.2908959727  0.3056539420
[26]  0.3154082570  0.3270272296  0.3366972133  0.3452990011  0.3266617942
[31]  0.3187252863  0.3145462042  0.3092895560  0.0000000000
# Communities are identified as components in the edge-pruned graph 
# The partitioning where the modularity is the highest is the one that gets chosen at the end.

# Add a node attribute for visual color based on the original 'Faction'.
# Faction is 1 or 2. We'll map this to 'gold' and 'steelblue'.
V(karate)$node_visual_color <- ifelse(V(karate)$Faction == 1, 'gold', 'steelblue')

# Add a unique node identifier, crucial for joining data later.
# Ensure it's character type for robust joins.
V(karate)$node_identifier <- as.character(1:vcount(karate))

# WALKTRAP
# The walk trap algorithm is based on a series of short random walks. 
# Random walks are hypothesized to stay within the same community.
walk_clustering <- cluster_walktrap(karate, step=10)
walk_clustering
IGRAPH clustering walktrap, groups: 3, mod: 0.43
+ groups:
  $`1`
   [1] "Actor 9"  "Actor 10" "Actor 15" "Actor 16" "Actor 19" "Actor 21"
   [7] "Actor 23" "Actor 24" "Actor 25" "Actor 26" "Actor 27" "Actor 28"
  [13] "Actor 29" "Actor 30" "Actor 31" "Actor 32" "Actor 33" "John A"  
  
  $`2`
   [1] "Mr Hi"    "Actor 2"  "Actor 3"  "Actor 4"  "Actor 8"  "Actor 12"
   [7] "Actor 13" "Actor 14" "Actor 18" "Actor 20" "Actor 22"
  
  $`3`
  + ... omitted several groups/vertices
# MOODY-WHITE
blocks_clustering <- cohesive_blocks(karate)
blocks_clustering$blocks
[[1]]
+ 34/34 vertices, named, from 4b458a1:
 [1] Mr Hi    Actor 2  Actor 3  Actor 4  Actor 5  Actor 6  Actor 7  Actor 8 
 [9] Actor 9  Actor 10 Actor 11 Actor 12 Actor 13 Actor 14 Actor 15 Actor 16
[17] Actor 17 Actor 18 Actor 19 Actor 20 Actor 21 Actor 22 Actor 23 Actor 24
[25] Actor 25 Actor 26 Actor 27 Actor 28 Actor 29 Actor 30 Actor 31 Actor 32
[33] Actor 33 John A  

[[2]]
+ 28/34 vertices, named, from 4b458a1:
 [1] Mr Hi    Actor 2  Actor 3  Actor 4  Actor 8  Actor 9  Actor 10 Actor 13
 [9] Actor 14 Actor 15 Actor 16 Actor 18 Actor 19 Actor 20 Actor 21 Actor 22
[17] Actor 23 Actor 24 Actor 25 Actor 26 Actor 27 Actor 28 Actor 29 Actor 30
[25] Actor 31 Actor 32 Actor 33 John A  

[[3]]
+ 6/34 vertices, named, from 4b458a1:
[1] Mr Hi    Actor 5  Actor 6  Actor 7  Actor 11 Actor 17

[[4]]
+ 5/34 vertices, named, from 4b458a1:
[1] Mr Hi   Actor 2 Actor 3 Actor 4 Actor 8

[[5]]
+ 7/34 vertices, named, from 4b458a1:
[1] Mr Hi    Actor 2  Actor 3  Actor 9  Actor 31 Actor 33 John A  

[[6]]
+ 5/34 vertices, named, from 4b458a1:
[1] Mr Hi    Actor 5  Actor 6  Actor 7  Actor 11

[[7]]
+ 5/34 vertices, named, from 4b458a1:
[1] Mr Hi    Actor 2  Actor 3  Actor 4  Actor 14

[[8]]
+ 10/34 vertices, named, from 4b458a1:
 [1] Actor 3  Actor 24 Actor 25 Actor 26 Actor 28 Actor 29 Actor 30 Actor 32
 [9] Actor 33 John A  
# Other clustering algorithms that you could use:
cluster_louvain(karate)
IGRAPH clustering multi level, groups: 4, mod: 0.44
+ groups:
  $`1`
   [1] "Mr Hi"    "Actor 2"  "Actor 3"  "Actor 4"  "Actor 8"  "Actor 12"
   [7] "Actor 13" "Actor 14" "Actor 18" "Actor 20" "Actor 22"
  
  $`2`
  [1] "Actor 5"  "Actor 6"  "Actor 7"  "Actor 11" "Actor 17"
  
  $`3`
   [1] "Actor 9"  "Actor 10" "Actor 15" "Actor 16" "Actor 19" "Actor 21"
   [7] "Actor 23" "Actor 27" "Actor 30" "Actor 31" "Actor 33" "John A"  
  + ... omitted several groups/vertices
cluster_fast_greedy(karate)
IGRAPH clustering fast greedy, groups: 3, mod: 0.43
+ groups:
  $`1`
   [1] "Actor 9"  "Actor 10" "Actor 15" "Actor 16" "Actor 19" "Actor 21"
   [7] "Actor 23" "Actor 24" "Actor 25" "Actor 26" "Actor 27" "Actor 28"
  [13] "Actor 29" "Actor 30" "Actor 31" "Actor 32" "Actor 33" "John A"  
  
  $`2`
   [1] "Mr Hi"    "Actor 2"  "Actor 3"  "Actor 4"  "Actor 8"  "Actor 12"
   [7] "Actor 13" "Actor 14" "Actor 18" "Actor 20" "Actor 22"
  
  $`3`
  + ... omitted several groups/vertices
cluster_leading_eigen(karate)
IGRAPH clustering leading eigenvector, groups: 5, mod: 0.44
+ groups:
  $`1`
   [1] "Mr Hi"    "Actor 2"  "Actor 3"  "Actor 4"  "Actor 8"  "Actor 13"
   [7] "Actor 14" "Actor 18" "Actor 20" "Actor 22"
  
  $`2`
   [1] "Actor 9"  "Actor 10" "Actor 15" "Actor 16" "Actor 19" "Actor 21"
   [7] "Actor 23" "Actor 27" "Actor 30" "Actor 31" "Actor 33" "John A"  
  
  $`3`
  [1] "Actor 5"  "Actor 6"  "Actor 7"  "Actor 11" "Actor 17"
  + ... omitted several groups/vertices
#?cluster_leading_eigen

# How to extract the grouping
gn_clustering$membership
 [1] 1 1 2 1 3 3 3 1 4 2 3 1 1 2 4 4 3 1 4 1 4 1 4 5 5 5 6 5 2 6 4 4 4 4
karate |> 
  activate(nodes) |>
  mutate(gn_group = as.factor(gn_clustering$membership),
         wt_group = as.factor(walk_clustering$membership),
         fg_group = as.factor(cluster_fast_greedy(karate)$membership),
         lu_group = as.factor(cluster_louvain(karate)$membership),
         le_group = as.factor(cluster_leading_eigen(karate)$membership)) -> karate

V(karate)$fg_group
 [1] 2 2 2 2 3 3 3 2 1 1 3 2 2 2 1 1 3 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3
# Add clustering to the visualisation
set.seed(42)
set_layout <- create_layout(karate, layout = 'fr')

set_layout |> 
  pivot_longer(
    cols = c(gn_group, wt_group, fg_group, lu_group, le_group), # Columns to pivot
    names_to = "algorithm",                 # New column for algorithm names
    values_to = "cluster_membership_id"     # New column for cluster IDs for that algorithm
  ) ->pivoted_plotting_data


set_layout |> 
  ggraph()+
  geom_edge_link(alpha = 0.4, colour = 'grey60') + # Slightly darker grey for better visibility
  # Draw nodes (common to all facets)
  # Node color is from 'node_visual_color' (gold/steelblue based on Faction)
  # This aes() is evaluated against graph_layout_with_all_attrs
  geom_node_point(aes(colour = node_visual_color), size = 3.5, show.legend = FALSE) +
  scale_colour_identity() + # Because node_visual_color directly specifies the colors

  # Draw community hulls using ggforce::geom_mark_hull
  # This layer uses 'pivoted_plotting_data', which is already faceted by 'algorithm'.
  # The aes() here is evaluated against pivoted_plotting_data.
  geom_mark_hull(
    data = pivoted_plotting_data,
    aes(x = x, y = y, 
        group = cluster_membership_id, 
        fill = cluster_membership_id),
    colour = NA, # No border for the hull polygons
    alpha = 0.35, # Hull transparency
    concavity = 4, # Adjust concavity
    expand = unit(2.5, 'mm'), # Padding around groups
    show.legend = FALSE # No legend for hull fill colors (cluster IDs)
  )+
  facet_wrap(~algorithm, ncol = 3) +# Arrange in a grid, adjust ncol as needed
  labs(
    title = "Zachary Karate Club Network: Community Detection Comparison",
    subtitle = "Nodes colored by original faction. Hulls show algorithm-detected communities.",
    caption = "Layout: Fruchterman-Reingold"
  ) +
  theme_graph(
    base_family = 'sans', # Clean font
    strip_text_size = 10, # Size for facet titles (algorithm names)
    plot_margin = margin(5, 5, 5, 5) # Add some margin around the plot
  ) 

What did you observe?

Probably you found that the results are quite different. Was there even a single pair of algorithms that returned exactly the same partitions?

There are different ways to continue from this finding: One option is to evaluate the partitioning based on some quality measure (e.g. the “modularity score”, or the “Normalised Mutual Information”). Another option would be to apply “consensus clustering” – an approach inspired by an observation from machine learning which can be summarised as ‘averaging several simple models often yields better accuracy than constructing the most sophisticated model possible’. The idea is very simple: Run several clustering algorithms (also the same one many times) and average the results. However, this method should not be applied blindly: Make sure the results you are averaging stem from an internally consistent ensemble of clustering algorithms (i.e. the same “family”, meaning: all algorithms apply the same logic. For example, don’t mix flow-based (infomap and co) with density optimization algorithms, etc.).

2.2.2 Quantifying homophily: Assortativity coefficient

At the meso-level, we could also be interested in quantifying the extent to which homophily is driving a network’s connections. This is where the assortativity coefficient can help us. The idea behind this measure is a comparison of the number of ties that occur in-goup vs such that bridge to another group compared to the total number of ties. The assortativity coefficient, however, is a bit more advanced than this simple illustration: It is able to take different group sizes/more than two groups into account. For the assortativity coefficient, 1 means perfect assortativity (assortativity: more in-group than between group ties), -1 means perfect disassortativity (disassortativity: more between group than in-group ties).

# Assortativity (for continuous (quantitative) attributes)
assortativity_degree(karate,directed=FALSE) # degree assortativity (Finding: high degree nodes tend to connect to low degree nodes)
[1] -0.4756131
# Assortativity for qualitative attributes
assortativity(karate, values = V(karate)$color, directed=FALSE)
[1] 0.7434211

Here we find the Karate network to be assorted by ‘support’ meaning that there is not to much mixing among support groups.

2.3 Micro level (local level):

Node level characteristics such as - centrality measures (degree centrality, betweenness centrality, closeness centrality, pagerank) - nodes’ degrees

On the micro-level, we are usually interested in the question: Which are the important nodes? In which way are they important?

To answer this question, we could start looking at the degree of a node, assuming that the ones with the most links are most likely the most important ones. But what if we want to measure another “type of importance”, e.g. if a node has a bridging functionality (critical for example, for information flow, exchange, exposure, etc.)? Such a node probably has a very small degree, but if it were gone, the topology of the network would be substantially different.

To capture these different qualities, we have different measures at hand (usually summarised as centrality measures). The most important ones are closeness centrality, betweenness centrality, and PageRank, and the local clustering coefficients. IGRAPH and SNA have the most important algorithms readily implemented.

# Centrality measures
degree(karate) # degree
   Mr Hi  Actor 2  Actor 3  Actor 4  Actor 5  Actor 6  Actor 7  Actor 8 
      16        9       10        6        3        4        4        4 
 Actor 9 Actor 10 Actor 11 Actor 12 Actor 13 Actor 14 Actor 15 Actor 16 
       5        2        3        1        2        5        2        2 
Actor 17 Actor 18 Actor 19 Actor 20 Actor 21 Actor 22 Actor 23 Actor 24 
       2        2        2        3        2        2        2        5 
Actor 25 Actor 26 Actor 27 Actor 28 Actor 29 Actor 30 Actor 31 Actor 32 
       3        3        2        4        3        4        4        6 
Actor 33   John A 
      12       17 
betweenness(karate) # betweenness centrality 
     Mr Hi    Actor 2    Actor 3    Actor 4    Actor 5    Actor 6    Actor 7 
250.150000  33.800000  36.650000   1.333333   0.500000  15.500000  15.500000 
   Actor 8    Actor 9   Actor 10   Actor 11   Actor 12   Actor 13   Actor 14 
  0.000000  13.100000   7.283333   0.500000   0.000000   0.000000   1.200000 
  Actor 15   Actor 16   Actor 17   Actor 18   Actor 19   Actor 20   Actor 21 
  0.000000   0.000000   0.000000  16.100000   3.000000 127.066667   0.000000 
  Actor 22   Actor 23   Actor 24   Actor 25   Actor 26   Actor 27   Actor 28 
  0.000000   0.000000   1.000000  33.833333   0.500000   0.000000   6.500000 
  Actor 29   Actor 30   Actor 31   Actor 32   Actor 33     John A 
 10.100000   0.000000   3.000000  66.333333  38.133333 209.500000 
#closeness(karate) # closeness centrality
#page_rank(karate)$vector # pagerank
#transitivity(karate, type = "local") # local clustering coefficient

# Assign centrality score as attribute, example here is betweenness centrality
karate |> activate(nodes) |> 
  mutate(degree = centrality_degree(),
         betweenness = centrality_betweenness(),
         closeness = centrality_closeness(),
         pagerank = centrality_pagerank()) -> karate


# Visualisation
hist(V(karate)$betweenness |> as.numeric())

# Normalise values to get a ranking
normalize <- function(x){(x-min(x))/(max(x)-min(x))} # min-max scaling brings everything to the interval [0,1] 

karate |> 
  activate(nodes) |> 
  mutate(
    betweenness_index = ((betweenness |> normalize()) * 9 |> round() + 1),
    closeness_index = ((closeness |> normalize()) * 9 |> round() + 1),
    degree_index = ((degree |> normalize()) * 9 |> round() + 1),
    pagerank_index = ((pagerank |> normalize()) * 9 |> round() + 1)
         
  ) -> karate

set.seed(42)
karate |> 
  create_layout(layout = 'fr') -> set_layout

p1 <- karate |> 
ggraph(layout = set_layout)+
  geom_edge_link(color = "gray") +
  geom_node_point(aes(color = betweenness_index), size = 10)+
  scale_color_gradient(low = "lightblue", high = "darkblue",
                       name  = "Index") +
  theme_void()+
  labs(title = "Degree centrality")
Warning: Ignoring `graph` as layout is already calculated
ℹ Pass the calculated layout to the `graph` argument to silence this warning
p2 <- karate |> 
  ggraph(layout = set_layout)+
  geom_edge_link(color = "gray") +
  geom_node_point(aes(color = betweenness_index), size = 10) +
  scale_color_gradient(low = "lightblue", high = "darkblue",
                       name  = "Index") +
  theme_void()+
  labs(title = "Betweenness centrality")
Warning: Ignoring `graph` as layout is already calculated
ℹ Pass the calculated layout to the `graph` argument to silence this warning
p3 <- karate |> 
  ggraph(layout = set_layout)+
  geom_edge_link() +
  geom_node_point(aes(color = closeness_index), size = 10) +
  scale_color_gradient(low = "lightblue", high = "darkblue",
                       name  = "Index") +
  theme_void()+
  labs(title = "Closeness centrality")
Warning: Ignoring `graph` as layout is already calculated
ℹ Pass the calculated layout to the `graph` argument to silence this warning
p4 <- karate |> 
  ggraph(layout = set_layout)+
  geom_edge_link() +
  geom_node_point(aes(color = pagerank_index), size = 10) +
  scale_color_gradient(low = "lightblue", high = "darkblue",
                       name  = "Index") +
  theme_void()+
  labs(title = "Pagerank centrality")
Warning: Ignoring `graph` as layout is already calculated
ℹ Pass the calculated layout to the `graph` argument to silence this warning
p1 + p2 + p3 + p4 + plot_layout(guides='collect') &
  theme(legend.position='bottom')

3 Network thinking

This section aims at stimulating your network thinking. You will discover two network peculiarities that might appear counterintuitive at first glance, but make a lot of sense once you take the networks’ topologies into consideration.

3.1 Majority illusion

In networks where we observe strong homophily, it might happen that we observe something like the “illusion of a majority”: Even if the majority does not hold a certain characteristic, we can draw a network in which most people believe the opposite is true.

Let’s look at an easy example. We can use our simple friendship network from above, which is in the variable g_undirected. In this network, nodes represent friends who are assigned one attribute (their gender).

To test for the majority illusion, we need to compare the average of the gender shares in all ego networks to the global gender share. Hence, we first have to iterate over all ego networks (one network per friend), remove the ego node for each ego network, and calculate the share of female friends a person “sees” in their ego network. Then, we need to average the perceived gender share among all nodes. In a second step, we need to calculate the gender shares in the friendship network from a global perspective.

Let’s extract an ego network for one person, for example, Amaya

mtx_friends <- g_undirected |> 
  as_adjacency_matrix(
    sparse = igraph_opt("sparsematrices")) |> 
  as.matrix()

genders <- V(g_undirected)$gender
names <- V(g_undirected)$name

ego_networks <- sna::ego.extract(mtx_friends, ego = NULL, neighborhood = c("combined"))
friends_of_ego <- colnames(ego_networks$Amaya)[colnames(ego_networks$Amaya) != "Amaya"]

# get gender shares in ego network
g_undirected |> 
  igraph::delete_vertices(!V(g_undirected)$name %in% friends_of_ego) |> 
  vertex_attr("gender") |> 
  table() |> 
  prop.table()

   d    f    m 
0.25 0.50 0.25 
# calculation of gender shares in terms of local averages
avg_f_gender = 0
avg_d_gender = 0
avg_m_gender = 0

genders <- V(g_undirected)$gender
names <- V(g_undirected)$name

mtx_friends <- g_undirected |> 
  as_adjacency_matrix(sparse = igraph_opt("sparsematrices")) |> 
                        as.matrix()
ego_networks <- sna::ego.extract(mtx_friends, ego = NULL, neighborhood = c("combined"))

for (ego_node in V(g_undirected)$name) {
  
  idx <- which(names == ego_node)
  friends_of_ego <- colnames(ego_networks[[idx]])[colnames(ego_networks[[idx]]) != ego_node]
  
  # get gender shares in ego network
  proportions = prop.table(table(V(g_undirected)$gender[which(V(g_undirected)$name %in% friends_of_ego)])) 
  
  if (is.na(as.table(proportions)["d"]) == FALSE) {
    avg_d_gender <- avg_d_gender + as.table(proportions)[["d"]]
  }
  
  if (is.na(as.table(proportions)["f"]) == FALSE) {
    avg_f_gender <- avg_f_gender + as.table(proportions)[["f"]]
  }

  if (is.na(as.table(proportions)["m"]) == FALSE) {
    avg_m_gender <- avg_m_gender + as.table(proportions)[["m"]]
  }
}

# Extract results
cat(paste("The average node thinks that", round(100 * avg_f_gender / length(names)), "% of the network is female,",
          round(100 * avg_d_gender / length(names)), "% diverse, and",
          round(100 * avg_m_gender / length(names)), "% male."))
The average node thinks that 68 % of the network is female, 7 % diverse, and 25 % male.
# calculation of the gender shares from a global perspective
global_absolut_f = 0
global_absolut_d = 0
global_absolut_m = 0

for (node in V(g_undirected)$name) { # Iterate over all nodes
  if (V(g_undirected)[node]$gender == "f") { # Count the number of female friends
      global_absolut_f <- global_absolut_f + 1
      }
  if (V(g_undirected)[node]$gender == "d") {  # Count the number of divers friends
      global_absolut_d <- global_absolut_d + 1
      }
  if (V(g_undirected)[node]$gender == "m") {  # Count the number of male friends
      global_absolut_m <- global_absolut_m + 1
      }
  }


# Extract results
cat(paste("From a global perspective,", round(100 * global_absolut_f / length(names)), "% of the network is female,",
          round(100 * global_absolut_d / length(names)), "% diverse, and",
          round(100 * global_absolut_m / length(names)), "% male."))
From a global perspective, 43 % of the network is female, 14 % diverse, and 43 % male.

3.1.1 Exercise 1:

What do you observe? Why do you see what you see?

YOUR COMMENTS HERE

3.1.2 Exercise 2 (Extra): Reflections on the majority illusion

  1. Which implications does this phenomenon, in your opinion, have for social cohesion? Or the “public opinion”? Political influence? …?

  2. The majority illusion is only one peculiarity. Another stunning case, driven by homophily, that I want to bring to your attention is that even mild preferences often result in very strict homophilic patterns (along the lines of “the aggregate is more than the sum (of individual-level preferences)”). One of the most famous examples to illustrate this segregation (see work of Nobel Prize-winning game theorist Thomas Schelling, and especially his paper Dynamic models of segregation from 1971). You can play an interactive version of the paper content here: Parable of Polygons. Even if people are happy being in the minority, the group ends up being segregated. Comment: There is no network in this interactive post, but one could easily transfer this to a friendship network where people are allowed to rewire their connections.

YOUR COMMENTS HERE

3.1.3 Exercise 3: Blog posts

For this exercise, we will use some (a bit more exiting) real-world data. Precisely, the dataset from this article. The dataset contains front-page hyperlinks (edges) between blogs (nodes) in the context of the 2004 US election (directed network). The dataset is available here.

  1. Load the data (edges stored in edges_blogs.txt) and assign the labels (stored in node_labels_blogs.txt) to every node.

  2. Repeat the steps as in Exercise 1. Do you find a majority illusion here, too?

df1_graph <- read.table(
  here("data", "edges_blogs.txt"), 
  sep="\t", 
  header=FALSE)
df2_nodes_attributes <- read.table(
  here("data", "node_labels_blogs.txt"),
  sep="\t", 
  header=FALSE)

colnames(df1_graph) <- c("node1", "node2")
colnames(df2_nodes_attributes) <- c("node", "attribute")


g_blogs <- graph_from_data_frame(df1_graph, directed=FALSE)
g_blogs <- set_vertex_attr(g_blogs, "leaning", index = V(g_blogs),
                           df2_nodes_attributes$attribute)
ecount(g_blogs)
[1] 16715
vcount(g_blogs)
[1] 1224
edge_density(g_blogs)
[1] 0.02233205
transitivity(g_blogs)
[1] 0.2259585
is_directed(g_blogs)
[1] FALSE
node_names <- V(g_blogs)$name

avg_right_leaning_tidy <- node_names |>
  purrr::map_dbl(function(current_ego_name) {
    # 1. Get the names of the alters (neighbors) for the current_ego_name
    alter_node_names <- names(neighbors(g_blogs, current_ego_name))

    # If the ego node has no alters, its contribution to the sum is 0
    if (length(alter_node_names) == 0) {
      return(0.0)
    }

    # 2. Filter 'df2_nodes_attributes' for these alters and get their attributes.
    #    We also filter out NA attributes to match the default behavior of `table()`.
    alters_attributes_df <- df2_nodes_attributes |>
      filter(as.character(node) %in% alter_node_names) |>
      filter(!is.na(attribute)) # Exclude rows where the attribute itself is NA

    # If no alters are found in the attributes dataframe or all their attributes are NA, contribution is 0
    if (nrow(alters_attributes_df) == 0) {
      return(0.0)
    }

    # 3. Calculate the proportions of each attribute among the alters
    proportions_summary <- alters_attributes_df |>
      count(attribute, name = "n_val") |>          # Count occurrences of each attribute
      mutate(proportion = n_val / sum(n_val))   # Calculate proportion for each attribute

    # 4. Extract the proportion for the "right-leaning" attribute
    right_leaning_prop_value <- proportions_summary |>
      filter(attribute == "right-leaning") |>
      pull(proportion)                            # Extract the proportion value

    # If "right-leaning" attribute was not found among alters, its proportion is 0
    if (length(right_leaning_prop_value) == 0) {
      return(0.0)
    } else {
      # pull() might return multiple values if 'attribute' wasn't unique after count (not the case here)
      # We take the first (and should be only) value.
      return(right_leaning_prop_value[1])
    }
  }) |>
  sum(na.rm = TRUE)

# Extract results
cat(paste("The average node thinks that", round(100 * avg_right_leaning_tidy / vcount(g_blogs), 2), "% of the network is right-leaning."))
The average node thinks that 53.61 % of the network is right-leaning.
# global perspective 
prop.table(table(V(g_blogs)$leaning))

 left-leaning right-leaning 
    0.4803922     0.5196078 
cat(paste("From a global perspective,", round(prop.table(table(V(g_blogs)$leaning))[["right-leaning"]]*100, 2), "% of the nodes are right-leaning."))
From a global perspective, 51.96 % of the nodes are right-leaning.

Even though both political blog communities are divided, the majority illusion does not seem to be present in this case. Both the averaged ego perspective and the global network perspective lead to similar results with respect to the share of nodes that are right-leaning.

However, the high attribute assortative coefficient indicates that blogs tend to reference those like them with regard to political orientation.

3.2 The friendship paradox

3.2.1 Exercise 4:

It’s time for the next peculiarity of network effects: The friendship paradox.

For this exercise, we will use the data from the following paper: Viswanath, B., Mislove, A., Cha, M., & Gummadi, K. P. (2009). On the evolution of user interaction in Facebook. Proceedings of the 2nd ACM Workshop on Online Social Networks, 37–42.

You can find it in the data folder under the name data-facebook.txt.

  1. Read in the data, and construct your graph.

  2. This time, we are not interested in the assortativity by age, but rather in assortativity by number of friends (i.e. the degree assortativity). Generate a plot that shows the average number of the friends of a person’s friends against the number of this person’s friends. Add the identity line (line that goes through (0,0) and (1,1)).

  3. Persons above the identity line have fewer friends than their average neighbour, while nodes below the identity line have more. The friendship paradox states that most nodes have fewer friends than their friends’ average. Can you check whether the friendship paradox is also at play in the Facebook network? (Hint: One needs to count the number of nodes above and below the identity line and to compare the size of both groups.)

  4. Calculate the degree assortativity.

df2_graph <- read.table(
                        here("data", "data_facebook.txt"), 
                        sep=" ", header=FALSE)
colnames(df2_graph) <- c("node1", "node2")

g_facebook <- graph_from_data_frame(
  df2_graph, directed=FALSE) |> 
  as_tbl_graph()

vcount(g_facebook)
[1] 63731
ecount(g_facebook)
[1] 817035
# Get the degrees of the nodes
g_facebook |> 
  activate(nodes) |>
  mutate(degree = centrality_degree()) -> g_facebook

neighbors_df <- g_facebook |> 
  activate(edges) |> 
  mutate(ego_degree = .N()$degree[from],
         alter_degree = .N()$degree[to])

neighbors_df |> 
  as_tibble() |>
  rename(ego = from, 
         alters = to, 
         ego_degree = ego_degree,
         alter_degree = alter_degree) -> merged_df

# calculate how many friends and ego's friends have on average
avg_friends_df <- merged_df |>  
  group_by(ego) |> 
  summarise(ego_degree = min(ego_degree), 
            avg_degree_neighbor = mean(alter_degree)) |> 
  na.omit()

# Plot the data
avg_friends_df |> 
  slice_sample(n = 2000) |>
  ggplot(aes(x = ego_degree, y = avg_degree_neighbor)) +
  geom_abline(slope = 1, 
              intercept = 0, color = "darkgreen", linetype = "dashed") +
    stat_density_2d(aes(fill = after_stat(level)), geom = "polygon", alpha = 1)+
  geom_point(fill = NA, size = 3, alpha = 0.1, color = "black", 
             shape = 21, stroke = 1) +
  scale_x_log10(limits = c(1, 1000)) +
  scale_y_log10(limits = c(1, 250)) +
  labs(x = "Number of friends", 
       y = "Number of friends of the average neighbour",
       title = "Number of friendships",
      caption  = "Sample of 2000") +
  theme_minimal()+
  scale_fill_continuous(
    limits = c(0,1), 
    breaks = c(0, 0.25, 0.5, 0.75, 1),
    type = "viridis", 
    name = "Density",
    guide = guide_colourbar(direction = "horizontal")
    )+
  theme(legend.position = c(0.7, 0.2),
        legend.title.position = "top",
        legend.ticks = element_blank())
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

# Count the number of nodes above and below the identity line
paradox_df <- merged_df |>   
  group_by(ego) |> 
  summarise(avg_degree_neighbor = mean(alter_degree), 
            ego_degree = mean(ego_degree))

paradox_df <- paradox_df |> 
  mutate(ego_more_than_alters = if_else(ego_degree > avg_degree_neighbor, "yes", "no"))

proportions <- prop.table(table(paradox_df$ego_more_than_alters))

# Check for the friendship paradox
if (proportions["no"] > proportions["yes"]) {
  print("We found a friendship paradox.")
} else {
  print("We did not observe a friendship paradox.")
}
[1] "We found a friendship paradox."

The friendship paradox is the observation that the degrees of the neighbours of a node in a network are, on average, greater than the degree of the node itself. In other words, your friends have more friends than you do. We see evidence for this in the data: if there was no degree assortativity in the network, points in the plot would scatter around the identity line. Instead, most points in the low and intermediate-range (people having one to around 80 friends) tend to be connected to individuals who have more friends. The most popular individuals in the network (100 friends and more) tend to be connected to individuals who have fewer friends, in other words, they have many ‘followers’ who are not so well connected.